A workflow for deriving chemical entities from crystallographic data and its application to the Crystallography Open Database

Knowledge about the 3-dimensional structure, orientation and interaction of chemical compounds is important in many areas of science and technology. X-ray crystallography is one of the experimental techniques capable of providing a large amount of structural information for a given compound, and it is widely used for characterisation of organic and metal-organic molecules. The method provides precise 3D coordinates of atoms inside crystals, however, it does not directly deliver information about certain chemical characteristics such as bond orders, delocalization, charges, lone electron pairs or lone electrons. These aspects of a molecular model have to be derived from crystallographic data using refined information about interatomic distances and atom types as well as employing general chemical knowledge. This publication describes a curated automatic pipeline for the derivation of chemical attributes of molecules from crystallographic models. The method is applied to build a catalogue of chemical entities in an open-access crystallographic database, the Crystallography Open Database (COD). The catalogue of such chemical entities is provided openly as a derived database. The content of this catalogue and the problems arising in the fully automated pipeline are discussed, along with the possibilities to introduce manual data curation into the process. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00780-2.


S1 Additional COD resources used for entry validity determination
Chemical descriptions generated from COD entries must satisfy certain quality criteria to be suitable for cross-linking with external resources.Only entries that pass all of the data quality tests are marked as valid and are allowed to remain in the final dataset.Some of the tests require the following additional summary files and local databases: cif-formulae-mismatch.tsv.A tab-separated value file which lists COD entries for which chemical formula declared in the original input CIF file differs from the chemical formula calculated from the corresponding stoichiometric CIF file.Such mismatches may be used to detect discrepancies in the input crystal structures such as missing atoms, incorrectly marked crystal symmetry or incorrect chemical formulae.
COD.sqlite3.An SQLite database which contains various properties of the input crystal structures such as cell constants, chemical formulae, experimental conditions, data provenance information as well as descriptions introduced by the COD database maintainers (e.g. the duplicate structure flag).
disorder-in-cod.sqlite3.An SQLite database which contains information about explicitly marked disorder in the input crystal structures.
These resources are generated from the same fixed COD revision as the one used to produce the final chemical descriptions.

S2 Selection of a representative conformation of a disordered molecule
The cif molecule program selects a representative conformation of a molecule in a disordered crystal structure by identifying the optimal combination of disordered group positions based on the following rules of decreasing priority: Groups with higher occupancies are preferred.In the scope of this algorithm an entire disordered group is assigned the highest occupancy that was observed among its constituent atom sites.
Groups with a higher number of disordered atom sites are preferred.
Groups with a lexicographically lesser name (unique identifier) are preferred.
The described approach produces satisfactory results with most positionally disordered structures but does not comprehensively represent compositionally disordered structures (see the "Handling of crystallographically disordered structures" section of the main publication).

S3 Derivation of the interatomic bond length distribution set
The interatomic bond length distribution set used by the cif-perceive-chemistry program was derived from open data using the following algorithm: 1. Calculate chemical structures of all COD entries using the Open Babel software suite by converting the stoichiometric CIF files to SDF files.
2. Determine the set of trustworthy chemical structures.A chemical structure is considered trustworthy if it matches the structure represented by a SMILES string from an expert-curated COD SMILES dataset [1].The comparison of chemical structures is carried out by converting both structures to canonical SMILES and comparing them as case sensitive text strings.
3. Derive the bond length distribution dataset based on the set of trustworthy chemical structures identified in step 2.
4. Recalculate chemical structures of all COD entries this time using the cif-perceive-chemistry program with the newly derived bond length distribution dataset.

5.
Determine the set of trustworthy chemical structures.This time a chemical structure is considered trustworthy if it does not contain significant deviations from previously observed molecular geometry and does not contain obvious errors such as an unbalanced molecular charge.
6. Derive the bond length distribution dataset based on the set of trustworthy chemical structures identified in step 5.
Sequential versions of the bond lengths distribution set (e.g., derived from future revisions of the COD database that may contain more entries) can be calculated by repeating steps 4-6.

S4 On aromatic and delocalized bonds
The data model of a molecule used by the OpenChemLib framework includes calculated atom and bond properties that may not have been explicitly specified in the input data and were instead determined using specific rules.Bond aromaticity and bond delocalization are two such properties that are used by the cif-perceive-chemistry program when assigning and validating a chemical structure.A bond is considered aromatic if it is in a ring and the ring satisfies the Hückel's rule.Similarly, a bond is considered delocalized if it is aromatic and if the ring does not have a preferred mesomeric state.For example, all bonds in a pyridine ring are recognised as delocalized while all bonds in a thiophene ring are recognised as aromatic, but not as delocalized.SDF files produced by the chemical perception pipeline do not explicitly identify aromatic or delocalized bonds in any way while the DWAR file retains this information using machine-readable idcode text strings that encode molecular structures in a canonical and compact way.

S5 Description of the bond length distribution set
Bonds in the bond length distribution set used by OpenChemLib and cif-perceive-chemistry are classified using the following bond properties: Bond order.A formally assigned enumeration state which classifies the bond as a single, double, or triple bond.
Bond aromaticity.A flag value which signifies if the bond was recognised as aromatic based on the rules specified in Section S4.
Bond delocalization.A flag value which signifies if the bond was recognised as delocalized based on the rules specified in Section S4.
Atomic numbers of the bonded atoms.The atomic numbers of the bonded atoms as specified in the periodic table of chemical elements.
π electron counts of the bonded atoms.The overall π electron count of an atom is calculated by summing up the π electron counts of each bond that an atom participates in with the assumption that a double bond requires a single π electron, while a triple bond requires two π electrons.Atoms which participate in at least one delocalized bond are assumed to contribute only a single π electron.The π electron count is only considered when dealing with chemical elements B, C, N, O, P and S.
Additional file 2 contains the bond length distribution set used in this work expressed in a tab-separated value format with the following columns: bond id.A unique identifier of the bond class that also encodes the bond properties in a compact way.bond length.The mean of the bond length distribution of the given bond class in ångströms.
bond std.The standard deviation of the bond length distribution of the given bond class in ångströms.
bond count.The number of bond length observations used to calculate the distribution.bond type.An alphanumeric string that identifies a combination of bond order and bond flag values.Values "1", "2" and "3" denote a single, double, and triple covalent bond.Values "a1" and "a2" denote single and double aromatic bond, respectively.Value "d" denotes a delocalized bond.atom 1 symbol.The chemical element symbol of bond atom a1.atom 2 symbol.The chemical element symbol of bond atom a2.atom 1 pi count.The π electron count of bond atom a1.This number is set to 0 if the chemical element is not one of B, C, N, O, P or S. atom 2 pi count.The π electron count of bond atom a2.This number is set to 0 if the chemical element is not one of B, C, N, O, P or S.
The same information in the cif-perceive-chemistry software package is encoded as a machine-readable resources/bondLengthData.txt file that is directly interpretable by the OpenChemLib framework.Note, however, that since a custom approach was used to derive the bond length distributions (see Section S3), values provided in this file differ from those given in the default bond length distribution set file distributed as part of OpenChemLib.

S6 Available data formats S6.1 Stoichiometric CIF file
In the scope of this work a stoichiometric CIF file is defined as a CIF file which explicitly lists all atoms of a stoichiometrically correct molecular ensemble instead of providing the conventional crystal structure description that lists atoms from the asymmetric unit.This type of file is usually generated from a conventional CIF file using the cif molecule program with the "--one-datablock-output", "--preserve-stoichiometry", "--largest-molecule-only" and "--split-disorder-groups" command line options.
A stoichiometric CIF file fully conforms to the CIF file syntax, but reuses some of the data items from the CIF CORE dictionary [2] with slightly different semantics.The main differences include: The ATOM SITE category loop explicitly describes all atoms that make up the molecule instead of only the asymmetric unit atoms.
Data items dealing with the space group information (e.g., space group symop operation xyz, space group name Hall) always describe the P 1 space group rather than the space group of the input crystal structure.Since the P 1 space group consists of a single x,y,z identity operation, no additional symmetricallyequivalent atoms are generated even if the symmetry operations are reapplied to the atoms described in the ATOM SITE category loop.
The cell formula units Z data item is always set to "1".
While this type of informal redefinition of the semantics may be viewed as a drawback, the reuse of existing data items makes the files readily interpretable by various external pieces of software such as Jmol or obabel.Furthermore, there are ongoing IUCr discussions [3] on the introduction of the audit.formalismdata item that would allow to describe such reuse cases in a more explicit way.Finally, files produced by the cif molecule program can normally be reliably identified by examining the contents of the audit creation method data item.
A stoichiometric CIF file may also contain a set of COD data items that record various additional properties such as the identified presence of a polymeric molecule ( cod molecule is polymer) or the original space group of the input crystal structure ( cod molecule space group IT number).Definitions of these data items are provided in the CIF COD [4] and CIF COD MOLECULE [5] dictionaries.

S6.2 DWAR file
COD molecule descriptions are also distributed as a single DWAR file [6].The DWAR file consists of an XML-like header that is intended to be interpreted by the open-source DataWarrior program followed by a data table.The data table adheres to the formatting rules of tab-separated value file and consists of a header row followed by multiple data rows, each of which describes a separate COD entry.An example DWAR file that has the same structure as a regular COD DWAR file, but only describes the 4 crystal structures that were explicitly referenced in the main publication instead of the entire COD dataset is provided as Additional file 4.
DWAR files distributed by the COD contain the following data columns: Structure.A machine-readable idcode text string that encodes the molecular structure in a canonical and compact way.Generated by and intended to be interpreted by the OpenChemLib framework.
idcoordinates3D.A machine-readable idcoordinates3D text string that encodes the 3D atomic coordinates of the molecular structure provided in the Structure field.Generated by and intended to be interpreted by the OpenChemLib framework.
FragFp.A machine-readable text string that encodes the FragFp binary fingerprint of the molecular structure provided in the Structure field.Relies on a dictionary of 512 predefined structure fragments.Generated by and intended to be interpreted by the OpenChemLib framework.For more information, see the "Similarity & Descriptors" section of the DataWarrior user manual [7].
SkelSpheres.A machine-readable text string that encodes the SkelSpheres descriptor of the molecular structure provided in the Structure field.Generated by and intended to be interpreted by the OpenChemLib framework.For more information, see the "Similarity & Descriptors" section of the DataWarrior user manual [7].
Source file.The name of the original COD CIF file that was used to generate the molecule description.Derived from the value of the cod data source file data item provided in the input CIF file.Used for data provenance purposes.
Source block.The code of the data block within the original COD CIF file that was used to generate the molecule description.Derived from the value of the cod data source block data item provided in the input CIF file.Used for data provenance purposes.
Has attached hydrogen atoms.A "yes"/"no" flag value which indicates if any of the atoms in the input CIF were marked as having attached hydrogen atoms instead of providing explicit coordinates of those hydrogen atoms.Note that molecule descriptions generated from CIF files marked with the "yes" value will contain at least some hydrogen atoms without explicit 3D coordinates.For more information on the concept of attached hydrogen atoms, see the description of the atom site attached hydrogens data item from the CIF CORE dictionary.
Space group IT number.The number of the space group that the molecule crystallised in.This number is derived from symmetry information provided in the input crystallographic file and follows the conventions described in the International Tables for Crystallography, Volume A. Among other applications, the space group number can be used to identify whether certain chiral molecular entitites represent a single enantiomer or a racemate in the processed crystal structure as discussed in the "Restoration of stoichiometrically correct molecular ensembles" section of the main publication.Authors.A list of authors of the peer-reviewed publication that describes the crystal structure.Derived from the value of the publ author name data item provided in the input CIF file.The used name syntax follows the BibTeX convention which is slightly different from the convention described by the IUCr in the definition of the publ author name data item provided in the input CIF file.
Title.The title of the peer-reviewed publication that describes the crystal structure.Derived from the value of the publ section title data item provided in the input CIF file.
Journal.The name of the journal in which the peer-reviewed publication that describes the crystal structure was published in.Derived from the value of the journal name full data item provided in the input CIF file.
Year.The publication year of the peer-reviewed publication that describes the crystal structure.Derived from the value of the journal year data item provided in the input CIF file.
Volume.The journal volume of the peer-reviewed publication that describes the crystal structure.Derived from the value of the journal volume data item provided in the input CIF file.
Issue.The journal issue of the peer-reviewed publication that describes the crystal structure.Derived from the value of the journal issue data item provided in the input CIF file.
First Page.The first page of the peer-reviewed publication that describes the crystal structure.Derived from the value of the journal page first data item provided in the input CIF file.
Last Page.The last page of the peer-reviewed publication that describes the crystal structure.Derived from the value of the journal page last data item provided in the input CIF file.
DOI.The DOI of the peer-reviewed publication that describes the crystal structure.Derived from the value of the journal paper doi data item provided in the input CIF file.
Method.The method which was used to determine the crystal structure as identified using a set of heuristics.Takes one of the values from the ["single crystal","powder diffraction","theoretical"] enumerated set.For more information on how this value is determined see the description of the "method" field in the COD SQL database description [8].
Radiation.The type of radiation which was used to determine the crystal structure.Derived from the value of the diffrn radiation probe data item provided in the input CIF file.
Wavelength.The wavelength of the radiation which was used to determine the crystal structure in ångströms.Derived from the value of the diffrn radiation wavelength data item provided in the input CIF file.
R-factor all.The residual factor for all reflections satisfying the resolution limits.For more information on how this value is determined see the description of the "Rall" field in the COD SQL database description [8].
Cell length a.The lattice parameter a of the crystal structure in ångströms.Derived from the value of the cell length a data item provided in the input CIF file.
Cell length b.The lattice parameter b of the crystal structure in ångströms.Derived from the value of the cell length b data item provided in the input CIF file.
Cell length c.The lattice parameter c of the crystal structure in ångströms.Derived from the value of the cell length c data item provided in the input CIF file.
Cell angle alpha.The lattice parameter alpha of the crystal structure in degrees of arc.Derived from the value of the cell angle alpha data item provided in the input CIF file.
Cell angle beta.The lattice parameter beta of the crystal structure in degrees of arc.Derived from the value of the cell angle beta data item provided in the input CIF file.
Cell angle gamma.The lattice parameter gamma of the crystal structure in degrees of arc.Derived from the value of the cell angle gamma data item provided in the input CIF file.
Cell volume.The volume of the crystal lattice calculated from the lattice parameters in cubic ångströms.
Space group H-M.The space group symbol of the crystal structure as described by Hermann-Mauguin.May be replaced by a superspace group symbol if one is explicitly defined in the input CIF file.For more information on how this value is determined see the description of the "sg" field in the COD SQL database description [8].
Space group Hall.The space group symbol of the crystal structure as described by Hall.Derived from the value of the space group name Hall data item provided in the input CIF file.

Is valid entry.
A "yes"/"no" flag value which indicates if an entry successfully passed all of the COD data quality tests.

S6.3 SDF files
The following data items may appear in the SDF files [9] distributed by the COD: COD SDF CIF SVN REVISION.Revision number of the input CIF file in the COD Subversion repository.
COD SDF DATA SOURCE FILE.Name of the input CIF file.
COD SDF DATA SOURCE BLOCK.Name of the CIF data block from the input CIF file.
COD SDF CREATION TIMESTAMP.Timestamp recorded at the end of the SDF file creation.
COD SDF SOFTWARE PACKAGE NAME.Name of the software package that was used to create the SDF file.
COD SDF SOFTWARE PACKAGE VERSION.Version of the software package that was used to create the SDF file.
COD SDF SPACE GROUP IT NUMBER.The number of the space group that the molecule crystallised in.This number is derived from symmetry information provided in the input crystallographic file and follows the conventions described in the International Tables for Crystallography, Volume A [10].Among other applications, the space group number can be used to identify whether certain chiral molecular entitites represent a single enantiomer or a racemate in the processed crystal structure as discussed in the "Restoration of stoichiometrically correct molecular ensembles" section of the main publication.
COD SDF STRUCTURE HAS ATTACHED HYDROGENS.A flag value that indicates if any of the atoms in the input CIF were marked as having attached hydrogen atoms instead of providing explicit coordinates of those hydrogen atoms.Enumeration values: yes.The input CIF file was marked as having attached hydrogen atoms therefore some hydrogen atoms will not be represented as distinct atoms in the SDF file.
no.The input CIF file was not marked as having attached hydrogen atoms therefore all hydrogen atoms will be represented as distinct atoms in the SDF file.This is the default value and thus often omitted.

COD SDF ATTACHED HYDROGEN ATOMS.
A multiline list that records the number of attached hydrogen atoms that were assigned to specific atoms.Each line in the list consists of an atom index and a corresponding number of attached hydrogen atoms separated by a single space symbol.
COD SDF ISSUES.A summary of discrepancies detected by the chemical structure validation tests described in the "Chemical structure validation" section of the main publication.The validation results are recorded as a single line where individual issues are separated by a single space symbol and each issue is expressed as a compact ASCII text string that follows an internal shorthand notation.An empty line indicates that no issues were detected.
COD SDF VALIDITY STATUS.A flag value that indicates if the original COD entry adheres to a set of additional quality criteria such as the presence of sufficient bibliographic information, match between the generated and calculated chemical formulae, etc. Enumeration values: -1. Entry successfully passed all the quality criteria tests.
-0. Entry failed at least one of the quality criteria tests or the generated molecule description contains some deviations from the expected results (see the COD SDF ISSUES data item).
PUBCHEM EXT DATASOURCE REGID.Unique identifier of the input COD entry (COD ID).
PUBCHEM EXT SUBSTANCE URL.URL of the input COD entry.
PUBCHEM SUBSTANCE COMMENT.Bibliographic reference to the original publication that describes the input crystal structure.
PUBCHEM SUBSTANCE SYNONYM.Chemical name of the substance observed in the crystal as extracted from the chemical name systematic and chemical name common CIF data items of the input CIF file.May contain several alternative names, one name per line.
PubChem data items are used with the permission of PubChem maintainers and do not deviate from the original intent expressed in the PubChem documentation [11,12].

S7 Data retrieval instructions
The generated molecular descriptions are distributed under the CC0 licence and can be retrieved in several ways.Data retrieval examples provided in this section were tested using a Bash shell under the Ubuntu 22.04 GNU/Linux system.For alternative ways to engage with these data, please visit https://molecules.crystallography.net.

S7.1 Retrieval of SDF files
The SDF files are organised in a way that is very similar to the way CIF files are organised in the main COD repository.Each SDF file is assigned a filename which corresponds to the 7-digit COD ID of the input COD entry and is placed in a directory tree location determined from the first 5 digits of the COD ID.For example, SDF file generated from COD entry 2231955 is assigned the 2231955.sdffilename and placed in the 2/23/19/ directory.The described SDF file layout is used by the following endpoints that can be used to retrieve the data: rsync://molecules.crystallography.net/sdf.Accessible using the rsync protocol.This is the recommended method for downloading and updating the data when the entire SDF dataset is required.Retrieval of individual files using this method is also possible.

S7.2 Retrieval of the DWAR file
The dataset of all successfully processed COD entries is also distributed as a single DWAR file (see Section S6.2).The DWAR file can be retrieved from the following endpoints: rsync://molecules.crystallography.net/dwar.Accessible using the rsync protocol.https://molecules.crystallography.net/cod/dwar/COD.dwar.Accessible using the https protocol, e.g.via curl or wget.Accessing this endpoint using a general purpose web browser such as Firefox should be done with caution since some browsers may try to display the entire ≈500 MB file instead of initiating a file download.

Usage examples:
Download the directory with the COD.dwar file using rsync: rsync -avz rsync://molecules.crystallography.net/dwar/dwar Download the DWAR file using curl via the https endpoint: The open-source DataWarrior program offers a large variety of functionalities for the analysis of chemical data.This section provides an example of how the DWAR file generated from the COD crystallographic data by the described workflow can be used for the detection of polymorphs.The provided example was tested on the Ubuntu 20.04 GNU/Linux system using the v05.05.00 version of DataWarrior downloaded directly from the developers website (https://openmolecules.org/datawarrior/download.html).Note, DataWarrior installers for Windows and MacOS-X systems are also available.
Steps to search the COD database for polymorphs of sulfamerazine: 1. Download and install DataWarrior as specified in https://openmolecules.org/datawarrior/download. html.
2. Download the COD DWAR data file following instructions provided in Section S7.2 and rename it to COD.dwar if needed.This should open a window similar to the one displayed in Figure S1.Note that the DataWarrior window is highly customisable thus undesired panels may be closed while the more relevant one may be moved or resized.4. Apply search filters to exclude entries with undesired features.By default, all of the automatically generated filters will be located in the scrollable panel on the right side of the screen (see Figure S2).The names of the filters correspond to the names of the data table columns to which they apply (see Section S6.2).Exclude all theoretical crystal structures by locating the Method section and removing the check mark from the Theoretical field.Exclude all structures that did not pass all of the COD data quality checks by locating the Is valid field and removing the check mark from the No field.This should draw the structural formula of sulfamerazine in the search field and automatically filter out all entries that do not fit this filter criterion (see Figure S3).Note, that the structure search field can be populated by multiple alternative methods (e.g.pasting a MDL Molfile, manually drawing the chemical fragment).
6. Further refine the structure search to match only sulfamerazine and not its derivatives or similar molecular entities.In the structure search field click on the "Is similar to [FragFP]" field and change it to "Is similar to [SkelSpheres]".Locate a slider on the left side of the chemical structure that goes from 0 to 1 and move it all the way up to 1 (see Figure S4).
7. Click on the Space Group IT number column name to sort the remaining fields by the space group number.
8. Inspect the remaining data using the method of your choice to locate polymorphs, for example: (a) Manually locate entries that have different space group numbers (recorded in the Space Group IT number column) or entries that have the same space group number, but significantly different cell lattice parameters (recorder in the a, b, c, alpha, beta, gamma columns).

Caveats of the described method:
The list of selected structures may also include some racemic crystal structures.However, these can be easily recognised and filtered out using by the value of the Space Group IT number column (see column description in Section S6.2).
Software package name.The name of the program that generated the molecule description.Used for data provenance purposes.Software package version.The version string of the program that generated the molecule description.Used for data provenance purposes.Creation timestamp.The molecule description creation timestamp in ISO 8601 format.Used for data provenance purposes.C1-Problems.A summary of discrepancies detected by the chemical structure validation tests described in the "Chemical structure validation" section of the main publication.The validation results are recorded as a single line where individual issues are separated by a single space symbol and each issue is expressed as a compact ASCII text string that follows an internal shorthand notation.An empty string indicates that no issues were detected.COD Number.A persistent unique identifier of the original input COD entry.Also known as a COD ID.Substance Name.The trivial name by which the compound is commonly known as.Derived from the value of the chemical name common data item provided in the input CIF file.Chemical Name.IUPAC or Chemical Abstracts full name of the compound.Derived from the value of the chemical name systematic data item provided in the input CIF file.

3 .
Start DataWarrior and open the COD.dwar file.Click File → Open in the top toolbar and select the file.

Figure S1 :
Figure S1: DataWarrior window upon the initial loading of the COD.dwar file.

5 .
Figure S2: DataWarrior window after filtering out all marked theoretical structures.The scrollable panel that contains the filter fields is marked using red lines.
(b) Alternatively, automate the final analysis by exporting the selected entries into a separate TSV or SDF file and post-process them using custom external programs.Create a new subset by clicking File → New From → Visible rows in the top toolbar.This should open a new DataWarrior window that can be manipulated independently from the original one.In the newly created window click File → Save Special and select either Textfile... or SD-File....

Figure S3 :
Figure S3: DataWarrior window after inputting the SMILES string of sulfamerazine into the structure search field.The structure search filed is marked using red lines.

Figure S4 :
Figure S4: DataWarrior window after further adjusting the structure search parameters.The similarity type selection field and the similarity slider are marked using red lines.